Lab Test
Evaluating Prompting Strategies with MedGemma for Medical Order Extraction
Balachandran, Abhinand, Durgapraveen, Bavana, Sudhagar, Gowsikkan Sikkan, S, Vidhya Varshany J, Rajkumar, Sriram
The accurate extraction of medical orders from doctor-patient conversations is a critical task for reducing clinical documentation burdens and ensuring patient safety. This paper details our team submission to the MEDIQA-OE-2025 Shared Task. We investigate the performance of MedGemma, a new domain-specific open-source language model, for structured order extraction. We systematically evaluate three distinct prompting paradigms: a straightforward one-Shot approach, a reasoning-focused ReAct framework, and a multi-step agentic workflow. Our experiments reveal that while more complex frameworks like ReAct and agentic flows are powerful, the simpler one-shot prompting method achieved the highest performance on the official validation set. We posit that on manually annotated transcripts, complex reasoning chains can lead to "overthinking" and introduce noise, making a direct approach more robust and efficient. Our work provides valuable insights into selecting appropriate prompting strategies for clinical information extraction in varied data conditions.
- Health & Medicine > Therapeutic Area (0.70)
- Health & Medicine > Diagnostic Medicine > Lab Test (0.47)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.47)
- Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.35)
MACD: Multi-Agent Clinical Diagnosis with Self-Learned Knowledge for LLM
Li, Wenliang, Yan, Rui, Zhang, Xu, Chen, Li, Zhu, Hongji, Zhao, Jing, Li, Junjun, Li, Mengru, Cao, Wei, Jiang, Zihang, Wei, Wei, Zhang, Kun, Zhou, Shaohua Kevin
Large language models (LLMs) have demonstrated notable potential in medical applications, yet they face substantial challenges in handling complex real-world clinical diagnoses using conventional prompting methods. Current prompt engineering and multi-agent approaches typically optimize isolated inferences, neglecting the accumulation of reusable clinical experience. To address this, this study proposes a novel Multi-Agent Clinical Diagnosis (MACD) framework, which allows LLMs to self-learn clinical knowledge via a multi-agent pipeline that summarizes, refines, and applies diagnostic insights. It mirrors how physicians develop expertise through experience, enabling more focused and accurate diagnosis on key disease-specific cues. We further extend it to a MACD-human collaborative workflow, where multiple LLM-based diagnostician agents engage in iterative consultations, supported by an evaluator agent and human oversight for cases where agreement is not reached. Evaluated on 4,390 real-world patient cases across seven diseases using diverse open-source LLMs (Llama-3.1 8B/70B, DeepSeek-R1-Distill-Llama 70B), MACD significantly improves primary diagnostic accuracy, outperforming established clinical guidelines with gains up to 22.3% (MACD). In direct comparison with physician-only diagnosis under the same evaluation protocol, MACD achieves comparable or superior performance, with improvements up to 16%. Furthermore, the MACD-human workflow yields an 18.6% improvement over physician-only diagnosis, demonstrating the synergistic potential of human-AI collaboration. Notably, the self-learned clinical knowledge exhibits strong cross-model stability, transferability across LLMs, and capacity for model-specific personalization.This work thus presents a scalable self-learning paradigm that bridges the gap between the intrinsic knowledge of LLMs.
- Asia > China > Anhui Province > Hefei (0.04)
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
- North America > United States (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Evaluation of Causal Reasoning for Large Language Models in Contextualized Clinical Scenarios of Laboratory Test Interpretation
Bhasuran, Balu, Prosperi, Mattia, Hanna, Karim, Petrilli, John, Washington, Caretia JeLayne, He, Zhe
This study evaluates causal reasoning in large language models (LLMs) using 99 clinically grounded laboratory test scenarios aligned with Pearl's Ladder of Causation: association, intervention, and counterfactual reasoning. We examined common laboratory tests such as hemoglobin A1c, creatinine, and vitamin D, and paired them with relevant causal factors including age, gender, obesity, and smoking. Two LLMs - GPT-o1 and Llama-3.2-8b-instruct - were tested, with responses evaluated by four medically trained human experts. GPT-o1 demonstrated stronger discriminative performance (AUROC overall = 0.80 +/- 0.12) compared to Llama-3.2-8b-instruct (0.73 +/- 0.15), with higher scores across association (0.75 vs 0.72), intervention (0.84 vs 0.70), and counterfactual reasoning (0.84 vs 0.69). Sensitivity (0.90 vs 0.84) and specificity (0.93 vs 0.80) were also greater for GPT-o1, with reasoning ratings showing similar trends. Both models performed best on intervention questions and worst on counterfactuals, particularly in altered outcome scenarios. These findings suggest GPT-o1 provides more consistent causal reasoning, but refinement is required before adoption in high-stakes clinical applications.
- North America > United States > Florida > Hillsborough County > Tampa (0.14)
- North America > United States > Florida > Alachua County > Gainesville (0.14)
- North America > United States > Florida > Leon County > Tallahassee (0.04)
- Europe > Finland > Uusimaa > Helsinki (0.04)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)
- Health & Medicine > Diagnostic Medicine > Lab Test (1.00)
- Health & Medicine > Consumer Health (1.00)
Towards Next-Generation Medical Agent: How o1 is Reshaping Decision-Making in Medical Scenarios
Xu, Shaochen, Zhou, Yifan, Liu, Zhengliang, Wu, Zihao, Zhong, Tianyang, Zhao, Huaqin, Li, Yiwei, Jiang, Hanqi, Pan, Yi, Chen, Junhao, Lu, Jin, Zhang, Wei, Zhang, Tuo, Zhang, Lu, Zhu, Dajiang, Li, Xiang, Liu, Wei, Li, Quanzheng, Sikora, Andrea, Zhai, Xiaoming, Xiang, Zhen, Liu, Tianming
Artificial Intelligence (AI) has become essential in modern healthcare, with large language models (LLMs) offering promising advances in clinical decision-making. Traditional model-based approaches, including those leveraging in-context demonstrations and those with specialized medical fine-tuning, have demonstrated strong performance in medical language processing but struggle with real-time adaptability, multi-step reasoning, and handling complex medical tasks. Agent-based AI systems address these limitations by incorporating reasoning traces, tool selection based on context, knowledge retrieval, and both short- and long-term memory. These additional features enable the medical AI agent to handle complex medical scenarios where decision-making should be built on real-time interaction with the environment. Therefore, unlike conventional model-based approaches that treat medical queries as isolated questions, medical AI agents approach them as complex tasks and behave more like human doctors. In this paper, we study the choice of the backbone LLM for medical AI agents, which is the foundation for the agent's overall reasoning and action generation. In particular, we consider the emergent o1 model and examine its impact on agents' reasoning, tool-use adaptability, and real-time information retrieval across diverse clinical scenarios, including high-stakes settings such as intensive care units (ICUs). Our findings demonstrate o1's ability to enhance diagnostic accuracy and consistency, paving the way for smarter, more responsive AI tools that support better patient outcomes and decision-making efficacy in clinical practice.
- North America > United States > Georgia > Clarke County > Athens (0.14)
- North America > United States > Texas > Tarrant County > Arlington (0.14)
- North America > United States > Colorado > Adams County > Aurora (0.14)
- (11 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Lab-AI -- Retrieval-Augmented Language Model for Personalized Lab Test Interpretation in Clinical Medicine
Wang, Xiaoyu, Ouyang, Haoyong, Bhasuran, Balu, Luo, Xiao, Hanna, Karim, Lustria, Mia Liza A., He, Zhe
Accurate interpretation of lab results is crucial in clinical medicine, yet most patient portals use universal normal ranges, ignoring factors like age and gender. This study introduces Lab-AI, an interactive system that offers personalized normal ranges using Retrieval-Augmented Generation (RAG) from credible health sources. Lab-AI has two modules: factor retrieval and normal range retrieval. We tested these on 68 lab tests--30 with conditional factors and 38 without. For tests with factors, normal ranges depend on patient-specific information. Our results show GPT-4-turbo with RAG achieved a 0.95 F1 score for factor retrieval and 0.993 accuracy for normal range retrieval. GPT-4-turbo with RAG outperformed the best non-RAG system by 29.1% in factor retrieval and showed 60.9% and 52.9% improvements in question-level and lab-level performance, respectively, for normal range retrieval. These findings highlight Lab-AI's potential to enhance patient understanding of lab results. Introduction The Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009 played a key role in promoting the adoption and meaningful use of electronic health records (EHRs) throughout the U.S. healthcare system. Through the Medicare and Medicaid EHR Incentive Programs, the Act provided financial incentives that facilitated widespread EHR adoption.
- North America > United States > Florida > Hillsborough County > Tampa (0.14)
- North America > United States > Oklahoma > Payne County > Stillwater (0.04)
- North America > United States > Georgia > Fulton County > Johns Creek (0.04)
- North America > United States > Florida > Leon County > Tallahassee (0.04)
Blood test for male infertility could be on the horizon: AI can screen men with 74% accuracy - with no semen needed
Although the terms are often confused or used interchangeably, sperm and semen are not the same thing. Semen is the fluid that comes out of the penis, while sperm are the microscopic cells within the semen. Sperm cells are specialized for the task of fertilizing an egg. Semen analysis is considered essential for diagnosis of male infertility, but is not readily available at medical institutions other than those specializing in infertility treatment. 'Fertility specialists take it for granted that the first step in diagnosing male infertility is to perform a semen analysis,' Professor Kobayashi added.
ED-Copilot: Reduce Emergency Department Wait Time with Language Model Diagnostic Assistance
Sun, Liwen, Agarwal, Abhineet, Kornblith, Aaron, Yu, Bin, Xiong, Chenyan
In the emergency department (ED), patients undergo triage and multiple laboratory tests before diagnosis. This time-consuming process causes ED crowding which impacts patient mortality, medical errors, staff burnout, etc. This work proposes (time) cost-effective diagnostic assistance that leverages artificial intelligence systems to help ED clinicians make efficient and accurate diagnoses. In collaboration with ED clinicians, we use public patient data to curate MIMIC-ED-Assist, a benchmark for AI systems to suggest laboratory tests that minimize wait time while accurately predicting critical outcomes such as death. With MIMIC-ED-Assist, we develop ED-Copilot which sequentially suggests patient-specific laboratory tests and makes diagnostic predictions. ED-Copilot employs a pre-trained bio-medical language model to encode patient information and uses reinforcement learning to minimize ED wait time and maximize prediction accuracy. On MIMIC-ED-Assist, ED-Copilot improves prediction accuracy over baselines while halving average wait time from four hours to two hours. ED-Copilot can also effectively personalize treatment recommendations based on patient severity, further highlighting its potential as a diagnostic assistant. Since MIMIC-ED-Assist is a retrospective benchmark, ED-Copilot is restricted to recommend only observed tests. We show ED-Copilot achieves competitive performance without this restriction as the maximum allowed time increases. Our code is available at https://github.com/cxcscmu/ED-Copilot.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > California > Alameda County > Berkeley (0.04)
- (3 more...)
- Health & Medicine > Diagnostic Medicine > Lab Test (0.76)
- Health & Medicine > Therapeutic Area > Immunology (0.68)
- Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.72)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Enabling Patient-side Disease Prediction via the Integration of Patient Narratives
Su, Zhixiang, Zhang, Yinan, Jing, Jiazheng, Xiao, Jie, Shen, Zhiqi
Disease prediction holds considerable significance in modern healthcare, because of its crucial role in facilitating early intervention and implementing effective prevention measures. However, most recent disease prediction approaches heavily rely on laboratory test outcomes (e.g., blood tests and medical imaging from X-rays). Gaining access to such data for precise disease prediction is often a complex task from the standpoint of a patient and is always only available post-patient consultation. To make disease prediction available from patient-side, we propose Personalized Medical Disease Prediction (PoMP), which predicts diseases using patient health narratives including textual descriptions and demographic information. By applying PoMP, patients can gain a clearer comprehension of their conditions, empowering them to directly seek appropriate medical specialists and thereby reducing the time spent navigating healthcare communication to locate suitable doctors. We conducted extensive experiments using real-world data from Haodf to showcase the effectiveness of PoMP.
- Asia > Singapore > Central Region > Singapore (0.06)
- Asia > China (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Health & Medicine > Diagnostic Medicine > Imaging (0.66)
- Health & Medicine > Diagnostic Medicine > Lab Test (0.54)
- Health & Medicine > Therapeutic Area > Oncology (0.46)
Large Language Multimodal Models for 5-Year Chronic Disease Cohort Prediction Using EHR Data
Ding, Jun-En, Thao, Phan Nguyen Minh, Peng, Wen-Chih, Wang, Jian-Zhe, Chug, Chun-Cheng, Hsieh, Min-Chen, Tseng, Yun-Chien, Chen, Ling, Luo, Dongsheng, Wang, Chi-Te, Chen, Pei-fu, Liu, Feng, Hung, Fang-Ming
Chronic diseases such as diabetes are the leading causes of morbidity and mortality worldwide. Numerous research studies have been attempted with various deep learning models in diagnosis. However, most previous studies had certain limitations, including using publicly available datasets (e.g. MIMIC), and imbalanced data. In this study, we collected five-year electronic health records (EHRs) from the Taiwan hospital database, including 1,420,596 clinical notes, 387,392 laboratory test results, and more than 1,505 laboratory test items, focusing on research pre-training large language models. We proposed a novel Large Language Multimodal Models (LLMMs) framework incorporating multimodal data from clinical notes and laboratory test results for the prediction of chronic disease risk. Our method combined a text embedding encoder and multi-head attention layer to learn laboratory test values, utilizing a deep neural network (DNN) module to merge blood features with chronic disease semantics into a latent space. In our experiments, we observe that clinicalBERT and PubMed-BERT, when combined with attention fusion, can achieve an accuracy of 73% in multiclass chronic diseases and diabetes prediction. By transforming laboratory test values into textual descriptions and employing the Flan T-5 model, we achieved a 76% Area Under the ROC Curve (AUROC), demonstrating the effectiveness of leveraging numerical text data for training and inference in language models. This approach significantly improves the accuracy of early-stage diabetes prediction.
- Asia > Taiwan > Taiwan > Taipei (0.05)
- North America > United States > New York (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (2 more...)
- Health & Medicine > Health Care Technology > Medical Record (1.00)
- Health & Medicine > Diagnostic Medicine > Lab Test (1.00)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.95)
Explainable Machine Learning for ICU Readmission Prediction
de Sá, Alex G. C., Gould, Daniel, Fedyukova, Anna, Nicholas, Mitchell, Dockrell, Lucy, Fletcher, Calvin, Pilcher, David, Capurro, Daniel, Ascher, David B., El-Khawas, Khaled, Pires, Douglas E. V.
The intensive care unit (ICU) comprises a complex hospital environment, where decisions made by clinicians have a high level of risk for the patients' lives. A comprehensive care pathway must then be followed to reduce p complications. Uncertain, competing and unplanned aspects within this environment increase the difficulty in uniformly implementing the care pathway. Readmission contributes to this pathway's difficulty, occurring when patients are admitted again to the ICU in a short timeframe, resulting in high mortality rates and high resource utilisation. Several works have tried to predict readmission through patients' medical information. Although they have some level of success while predicting readmission, those works do not properly assess, characterise and understand readmission prediction. This work proposes a standardised and explainable machine learning pipeline to model patient readmission on a multicentric database (i.e., the eICU cohort with 166,355 patients, 200,859 admissions and 6,021 readmissions) while validating it on monocentric (i.e., the MIMIC IV cohort with 382,278 patients, 523,740 admissions and 5,984 readmissions) and multicentric settings. Our machine learning pipeline achieved predictive performance in terms of the area of the receiver operating characteristic curve (AUC) up to 0.7 with a Random Forest classification model, yielding an overall good calibration and consistency on validation sets. From explanations provided by the constructed models, we could also derive a set of insightful conclusions, primarily on variables related to vital signs and blood tests (e.g., albumin, blood urea nitrogen and hemoglobin levels), demographics (e.g., age, and admission height and weight), and ICU-associated variables (e.g., unit type). These insights provide an invaluable source of information during clinicians' decision-making while discharging ICU patients.
- North America > United States (0.14)
- Oceania > Australia > Victoria > Melbourne (0.04)
- Oceania > New Zealand (0.04)
- (4 more...)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Health Care Providers & Services (1.00)
- Health & Medicine > Diagnostic Medicine > Lab Test (0.34)